ASA DataFest 2024 Workshop

Module 02: Data Visualisation

Iris Jiang & Thomas Fung

School of Mathematical and Physical Sciences

Contents

  • Data Visualisation with ggplot2
  • Combine the skills learnt so far.

Acknowledgement

  • The content in this module are heavily inspired by the following work:
    • Emi Tanaka’s Data Visualisation with R workshop.
    • Alison Hill’s presentation: “Plot Twist. 10 Bake Offs, 11 Ways”.

If you have done this workshop/STAT1378 before

  • I recommend your watch this presentation by Alison Hill <Take a Sad Plot & Make it Better> instead.

Data Visualisation with ggplot2

Constructing plots with R: base version


df
# # A tibble: 3 × 2
#   duty      perc
#   <chr>    <dbl>
# 1 Teaching    40
# 2 Research    40
# 3 Admin       20
  • Single purpose functions to generate “named plots”
  • Stacked barplot
barplot(as.matrix(df$perc),
  legend = df$duty
)

  • Pie chart
pie(df$perc, labels = df$duty)

Artwork by @allison_horst

Data Visualisation with ggplot2.

  • R has several systems for making graphs, but ggplot2 of Wickham et al. (2022) is one of the most elegant and most versatile.
  • ggplot2 implements the grammar of graphics, a coherent system for describing and building graphs.
  • The best way to understand the Grammar of Graphics is to see it explained in action:
  • With ggplot2, you can do more faster by learning one system and applying it in many places.

Note

  • With ggplot2, you begin a plot with the function ggplot().
    • ggplot() creates a coordinate system that you can add layers to.
    • The first argument of ggplot() is the dataset to use in the graph.
    • So ggplot(data = df) creates an empty graph, so I’m not going to show it here.
  • You complete your graph by adding one or more layers to ggplot().
    • The function geom_dotplot() adds a layer of points to your plot, which creates a dot plot.
  • ggplot2 comes with many geom functions that each add a different type of layer to a plot.
  • Each geom function in ggplot2 takes a mapping argument.
    • This defines how variables in your dataset are mapped to visual properties.
    • The mapping argument is always paired with aes(), and the x and y arguments of aes() specify which variables to map to the x and y axes.
    • ggplot2 looks for the mapped variables in the data argument, in this case.

A graphing template

  • Let’s turn this code into a reusable template for making graphs with ggplot2.
  • To make a graph, replace the bracketed sections in the code below with a dataset, a geom function, or a collection of mappings.
ggplot(data = <DATA>) + 
  <GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))

Palmer penguins

  • penguins data is from the palmerpenguins package of Horst, Hill, and Gorman (2022).

Palmer penguins (cont.)


# pak::pak("palmerpenguins") # if not installed or
# install.packages("palmerpenguins")
library(tidyverse)
library(palmerpenguins)
glimpse(penguins)
# Rows: 344
# Columns: 8
# $ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
# $ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
# $ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
# $ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
# $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
# $ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
# $ sex               <fct> male, female, female, NA, female, male, female, male…
# $ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

  • For example:
library(tidyverse)
ggplot(
  data = penguins,
  mapping = aes(
    x = bill_depth_mm,
    y = bill_length_mm,
    colour = species
  )
) +
  geom_point()

Aestnestic mappings

Artwork by Emi Tanaka

Hidden argument names in ggplot

Artwork by Emi Tanaka
  • No need to write out explicitly data =, mapping =, x =, and y = each time in ggplot.
  • ggplot code in the wild often omit these argument names.
  • But position needs to be correct if argument names are not specified!

Hidden argument names in ggplot

For example:

ggplot(
  penguins,
  aes(species)
)

ggplot(
  penguins,
  aes(species, bill_length_mm)
)

Different geometric objects:


p <- ggplot(penguins, aes(species, bill_length_mm))

p + geom_boxplot()

p + geom_point()

geom

  • Here is a list of some commonly used geom:

geom Title
geom_abline, geom_hline, geom_vline Reference lines: horizontal, vertical, and diagonal
geom_bar, geom_col Bar charts
geom_boxplot A box and whiskers plot (in the style of Tukey)
geom_density Smoothed density estimates
geom_dotplot Dot plot
geom_freqpoly, geom_histogram Histograms and frequency polygons
geom_jitter Jittered points
geom_path, geom_line, geom_step Connect observations
geom_point Points
geom_qq_line, geom_qq A quantile-quantile plot
geom_smooth Smoothed conditional means
geom_label, geom_text Text
geom_violin Violin plot

Statistical transformation


g <- ggplot(penguins, aes(species, bill_length_mm)) +
  geom_boxplot()
  • Notice the y-axis is not the raw data!
  • It is plotting a statistical transformation of the y-values.
  • Under the hood, data is transformed (including x factor input to numerical values).

Add multiple layers

Artwork by Emi Tanaka

Add multiple layers (cont.)

  • Each layer inherits mapping and data from ggplot by default.
ggplot(penguins, 
       aes(x = species, 
           y = bill_length_mm)
       ) +
  geom_violin() +
  geom_boxplot() +
  geom_point()

Order of the layers matters!

  • Boxplot and violin plot order are switched around.
ggplot(
  penguins,
  aes(species, bill_length_mm)
) +
  geom_violin() +
  geom_boxplot() +
  geom_point()

ggplot(
  penguins,
  aes(species, bill_length_mm)
) +
  geom_boxplot() +
  geom_violin() +
  geom_point()

Layer-specific data and aesthetic mappings

Artwork by Emi Tanaka

Layer-specific data and aesthetic mappings (cont.)

  • For each layer, aesthetic and/or data can be overwritten.
ggplot(penguins, aes(species, bill_length_mm)) +
  geom_violin(aes(fill = species)) +
  geom_boxplot(data = filter(penguins, species == "Adelie")) +
  geom_point(
    data = filter(penguins, species == "Gentoo"),
    aes(y = bill_depth_mm)
  )

Aesthestic or Attribute?


ggplot(penguins) +
  geom_point(aes(body_mass_g,
    bill_depth_mm,
    colour = "blue"
  ))

  • The points are not “blue” in colour.

Aesthestic or Attribute?


ggplot(penguins) +
  geom_point(aes(body_mass_g,
    bill_depth_mm,
    colour = "blue"
  ))

  • The points are not “blue” in colour.
  • What you really want
ggplot(penguins) +
  geom_point(aes(
    body_mass_g,
    bill_depth_mm
  ),
  colour = "blue"
  )

group in ggplot


ggplot(
  penguins,
  aes(
    body_mass_g,
    bill_depth_mm
  )
) +
  geom_point(aes(colour = species)) +
  geom_smooth(method = "lm")

  • What if we want to draw the fit of a simple linear model for each cluster?

group in ggplot (cont.)


ggplot(
  penguins,
  aes(
    body_mass_g,
    bill_depth_mm
  )
) +
  geom_point(aes(colour = species)) +
  geom_smooth(
    method = "lm",
    aes(group = species == "Gentoo")
  )

Note

  • Notice that if you do
geom_smooth(method = "lm", aes(group = species))

it will give 3 separate regression lines.

Facetting


g <- ggplot(
  penguins,
  aes(bill_length_mm, bill_depth_mm, 
      colour = species)
) +
  geom_point()
g

Facetting (cont.)

  • If you have only one variable with many levels:
g + facet_wrap(~sex)

Facetting (cont.)

  • If you have two discrete variables
g + facet_grid(island ~ sex)

Your mission, should you choose to accept it!

  • This question is inspired by Alison Hill’s presentation: “Plot Twist. 10 Bake Offs, 11 Ways”.
  • We are interested in visualising the ratings about the “The Great British Bake Off” TV series.
  • In front of you, you should have the print out of a few datasets/tibbles.
  • Your task is to match the tibbles to the plots and also answer a few questions about the plots.
  • Please have a discussion within your group to generate/confirm your ideas.

Recipe 1: Questions

  • Which dataset?
  • Which geom?
  • What variable is mapped on the x-axis?
  • What variable is mapped on the y-axis?
  • What variable is mapped to colour of the bar?

02:00

Recipe 2: Questions

  • Which dataset?
  • Which geom?
  • What variable is grouped?
  • What variable is mapped on the x-axis?
  • What variable is mapped on the y-axis?
  • What variable is mapped to colour?

02:00

Recipe 3: Questions

  • Which dataset?
  • Which geom?
  • What variable is facetted?
  • What variable is grouped?
  • What variable is mapped on the x-axis?
  • What variable is mapped on the y-axis?
  • What variable is mapped to colour?

02:00

Recipe 4: Questions

  • Which dataset?
  • Which geoms?
  • What variable is grouped?
  • What variable is mapped on the x-axis?
  • What variable is mapped on the y-axis?
  • What variable is mapped to colour?

02:00

Recipe 5: Questions

  • Which dataset?
  • Which geoms?
  • What variable is grouped?
  • What variable is mapped on the x-axis?
  • What variable is mapped on the y-axis?
  • What variable is mapped to colour?

02:00

Recipe 6: Questions

  • Which dataset?
  • Which geom?
  • What variable is mapped on the x-axis?
  • What variable is mapped on the y-axis?
  • What variable is mapped to colour?

02:00

The patchwork package

Artwork by @allison_horst

How to get your plots out of R Studio?

  • At this stage, these beautiful plots are not much use if they are stuck in RStudio.
  • The easiest way to export your plots and save them elsewhere on your computer is by using ggsave().

Think for Yourself, Question Authority

  • Visualisation is a powerful tool for communicating data.
  • But it can also be used to mislead others.
  • What’s wrong with this visualisation?

Think for Yourself, Question Authority (cont.)

  • FAFSA = Free Application for Federal Student Aid.
  • A form for college students to determine their eligibility for student financial aid in the US.
  • What is your first impression this plot?

Think for Yourself, Question Authority (cont.)

  • FAFSA = Free Application for Federal Student Aid.
  • A form for college students to determine their eligibility for student financial aid in the US.
  • But in reality:

Your mission, should you choose to accept it!

  • Now it’s your turn to create a few basic plots.
  • Try to predict the what the code would do before running them.
    • You can find the code inside module02_exercises.html.
  • Just give us a yell if you have any question.
10:00

Combining the skills learnt so far

Recipe 1 Plot

# create coordinates for labels
series_labels <- dat1 %>%
  group_by(series) %>%
  summarize(
    y_position = median(viewers_7day) + 1,
    x_position = mean(episode_count)
  )
# make the plot
ggplot(dat1, aes(x = episode_count, y = viewers_7day, fill = series)) +
  geom_col(alpha = .9) +
  ggtitle("Series 8 was a Big Setback in Viewers",
    subtitle = "7-Day Viewers across All Series/Episodes"
  ) +
  geom_text(data = series_labels, aes(
    label = series,
    x = x_position,
    y = y_position
  )) +
  theme(
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank(),
    axis.title.x = element_blank(),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.x = element_blank()
  ) +
  scale_x_discrete(expand = c(0, 0))

Recipe 2 Plot

line_labels <- ratings %>%
  group_by(series) %>%
  mutate(episode = as.numeric(episode)) %>%
  slice_tail(n = 1) %>%
  select(series, x_position = episode, y_position = viewers_7day)

ggplot(ratings, aes(
  x = as.numeric(episode),
  y = viewers_7day,
  colour = series,
  group = series
)) +
  geom_line() +
  labs(colour = "Series", x = "Episode") +
  geom_text(data = line_labels, aes(
    label = series,
    x = x_position + .25,
    y = y_position
  ))

Recipe 3 Plot

ggplot(ratings, aes(
  x = episode,
  y = viewers_7day,
  colour = fct_reorder2(series, episode, viewers_7day),
  group = series
)) +
  facet_wrap(~series) +
  geom_line(lwd = 2) +
  labs(colour = "Series", x = "Episode")

Recipe 4 Plot

# code for plot
ggplot(dat2, aes(
  x = series,
  y = viewers_7day,
  colour = fct_reorder2(episode, series, viewers_7day),
  group = episode
)) +
  geom_point() +
  geom_line() +
  ggtitle("Great British Bake Off Finales Get More Viewers than Premieres") +
  labs(colour = "Episode")

Recipe 5 Plot

slope_labels <- dat2 %>%
  filter(episode == "last") %>%
  select(series, x_position = episode, y_position = viewers_7day)
ggplot(dat2,
  aes(
    x = episode,
    y = viewers_7day,
    colour = series,
    group = series
  )
) +
  geom_point() +
  geom_line() +
  geom_text(
    data = slope_labels, aes(
      label = series,
      x = x_position,
      y = y_position
    ),
    nudge_x = .1
  ) +
  theme(
    panel.grid = element_blank(),
    axis.line = element_line(colour = "gray")
  )

Recipe 6 Plot

# plot
ggplot(dat3, aes(
  x = fct_rev(series),
  y = finale_bump
)) + 
  geom_col(alpha = .7) +
  coord_flip() +
  labs(x = "Series", y = "Difference in Viewers for Finale from Premiere (millions)") +
  ggtitle("Finale 'Bumps' were Smallest for Series 10",
    subtitle = "Finale 7-day Viewers Relative to Premiere"
  )

Skills

  • Data visualisation with ggplot2
  • Combine the skills learnt so far.

References

Horst, Allison, Alison Hill, and Kristen Gorman. 2022. “Palmerpenguins: Palmer Archipelago (Antarctica) Penguin Data.” https://CRAN.R-project.org/package=palmerpenguins.
Pedersen, Thomas Lin. 2022. “Patchwork: The Composer of Plots.” https://CRAN.R-project.org/package=patchwork.
Wickham, Hadley, Winston Chang, Lionel Henry, Thomas Lin Pedersen, Kohske Takahashi, Claus Wilke, Kara Woo, Hiroaki Yutani, Dewey Dunnington, and RStudio. 2022. “Ggplot2: Create Elegant Data Visualisations Using the Grammar of Graphics.” https://CRAN.R-project.org/package=ggplot2.